library(dplyr)
library(ggplot2)
biopics <- read.csv("~/edi_imp/handing/biopics.csv")
Los datos faltantes son comunes en diferentes conjuntos de datos. El proceso de completar los valores que faltan se conoce como imputación. Saber cómo imputar datos de manera adecuada es una habilidad esencial si se desea producir predicciones precisas y destacarse en el análisis de datos.
En este curso, aprenderá a utilizar visualizaciones y pruebas estadísticas para reconocer patrones en los datos faltantes. Además, se presentarán distintos modelos estadísticos y de aprendizaje automático que permitirán imputar los datos de manera eficiente. Asimismo, se desarrollarán habilidades para la toma de decisiones, lo que le permitirá seleccionar el método de imputación más adecuado para cada situación.
Finalmente, se abordará la incorporación de la incertidumbre de la imputación en las inferencias y predicciones, lo que contribuirá a obtener resultados más sólidos y confiables. En resumen, al finalizar el curso, estará capacitado para imputar datos de manera efectiva y tomar decisiones informadas en el proceso.
La falta de datos es un problema común en el análisis de datos y su tratamiento adecuado es de gran importancia. Ignorar los puntos de datos faltantes o llenarlos incorrectamente puede generar resultados inesperados en los modelos y causar que las predicciones e inferencias estén sesgadas.
En este capítulo, trabajará con un conjunto de datos de películas biográficas, conocido como biopics. Este conjunto de datos contiene información sobre las ganancias, características temáticas y otras variables de estas películas, pero presenta valores faltantes en algunos de los registros. Aunque los datos originales están disponibles en el paquete R fivethirtyeight, en este curso trabajará con una versión preprocesada para simplificar el análisis.
En este ejercicio, conocerá más acerca del conjunto de datos y ajustará un modelo de regresión lineal para explicar las ganancias de las películas biográficas. ¡Empecemos!
Imprime las primeras 10 observaciones de los datos de las biopics y familiarízate con las variables. Sigue los siguientes pasos que entregan un análisis inicial.
# Print first 10 observations
head(biopics, 10)
## country year earnings sub_num sub_type sub_race non_white sub_sex
## 1 UK 1971 NA 1 Criminal <NA> 0 Male
## 2 US/UK 2013 56.700 1 Other African 1 Male
## 3 US/UK 2010 18.300 1 Athlete <NA> 0 Male
## 4 Canada 2014 NA 1 Other White 0 Male
## 5 US 1998 0.537 1 Other <NA> 0 Male
## 6 US 2008 81.200 1 Other other 1 Male
## 7 UK 2002 1.130 1 Musician White 0 Male
## 8 US 2013 95.000 1 Athlete African 1 Male
## 9 US 1994 19.600 1 Athlete <NA> 0 Male
## 10 US/UK 1987 1.080 2 Author <NA> 0 Male
# Get the number of missing values per variable
biopics %>%
is.na() %>%
colSums()
## country year earnings sub_num sub_type sub_race non_white sub_sex
## 0 0 324 0 0 197 0 0
# Fit linear regression to predict earnings
model_1 <- lm(earnings ~ country + year + sub_type,
data = biopics)
# Fit linear regression to predict earnings
model_2 <- lm(earnings ~ country + year + sub_type + sub_race,
data = biopics)
En este ejercicio, se presentan seis escenarios distintos en los que se presentan datos faltantes. El objetivo es asignar a cada uno de ellos el mecanismo de datos faltantes más probable. A continuación, se presentan algunas pautas generales para realizar esta tarea:
Estas categorías de mecanismos de datos faltantes son importantes ya que los métodos para imputar valores faltantes y realizar inferencias estadísticas pueden variar según el mecanismo de datos faltantes subyacente. Por lo tanto, es crucial determinar correctamente el mecanismo de datos faltantes en cada escenario.
alt text
Great work on classifying the missing data mechanisms in the last exercise! Of all three, MAR is arguably the most important one to detect, as many imputation methods assume the data are MAR. This exercise will, therefore, focus on testing for MAR.
You will be working with the familiar biopics data. The goal is to test whether the number of missing values in earnings differs per subject’s gender. In this exercise, you will only prepare the data for the t-test. First, you will create a dummy variable indicating missingness in earnings. Then, you will split it per gender by first filtering the data to keep one of the genders, and then pulling the dummy variable. For filtering, it might be helpful to print biopics’s head() in the console and examine the gender variable.
Entre los tres, se puede decir que el mecanismo MAR (Missing At Random) es el más importante de detectar, ya que muchos métodos de imputación asumen que los datos son MAR. Por lo tanto, este ejercicio se centrará en la realización de pruebas de MAR.
En este ejercicio, se trabajará con los datos del conjunto biopics. El objetivo es verificar si el número de valores faltantes en la variable “earnings” difiere según el género del sujeto. En primer lugar, se creará una variable ficticia que indique la falta de datos en la variable earnings. A continuación, se dividirá el conjunto de datos por género, filtrando en primer lugar los datos para mantener uno de los géneros y luego extrayendo la variable ficticia. Para realizar este filtrado, podría ser útil imprimir el resultado de head() del conjunto biopics en la consola y examinar la variable de género.
# Create a dummy variable for missing earnings
biopics <- biopics %>%
mutate(missing_earnings = is.na(earnings))
# Pull the missing earnings dummy for males
missing_earnings_males <- biopics %>%
filter(sub_sex == "Male") %>%
pull(missing_earnings)
# Pull the missing earnings dummy for females
missing_earnings_females <- biopics %>%
filter(sub_sex == "Female") %>%
pull(missing_earnings)
En el último ejercicio, ha preparado dos vectores con los valores de ingresos que faltan para cada género: missing_earnings_males y missing_earnings_females. Ambos están disponibles en su espacio de trabajo. ¡Ahora puede realizar la prueba t para verificar si sus medias son significativamente diferentes entre sí! ¡Hagamos algunas pruebas estadísticas serias!
# Run the t-test
t.test(missing_earnings_males, missing_earnings_females)
##
## Welch Two Sample t-test
##
## data: missing_earnings_males and missing_earnings_females
## t = 1.1116, df = 294.39, p-value = 0.2672
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03606549 0.12969214
## sample estimates:
## mean of x mean of y
## 0.4366438 0.3898305
The aggregation plot provides the answer to the basic question one may ask about an incomplete dataset: in which combinations of variables the data are missing, and how often? It is very useful for gaining a high-level overview of the missingness patterns. For example, it makes it immediately visible if there is some combination of variables that are often missing together, which might suggest some relation between them.
In this exercise, you will first draw the aggregation plot for the biopics data and then practice making conclusions based on it. Let’s do some plotting!
# Load the VIM package
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# Draw an aggregation plot of biopics
biopics %>%
aggr(combined = TRUE, numbers = TRUE)
Based on the aggregation plot you have just created, which of the following statements is false?
Possible Answers
10% of the observations have missing values in both earnings and sub_race.
There are more missing values in sub_race than in earnings.
42% of the observations have no missing entries.
There are exactly two variables in the biopics data that have missing values.
The aggregation plot you have drawn in the previous exercise gave you some high-level overview of the missing data. If you are interested in the interaction between specific variables, a spine plot is the way to go. It allows you to study the percentage of missing values in one variable for different values of the other, which is conceptually very similar to the t-tests you have been running in the previous lesson.
In this exercise, you will draw a spine plot to investigate the percentage of missing data in earnings for different categories of sub_race. Is there more missing data on earnings for some specific races of the movie’s main character? Let’s find out! The VIM package has already been loaded for you.
# Draw a spine plot to analyse missing values in earnings by sub_race
biopics %>%
dplyr::select(sub_race,earnings) %>%
spineMiss()
### Question
Based on the spine plot you have just created, which of the following statements is false?
Possible Answers
In the vast majority of movies, the main character is white.
When the main subject is African, we are the most likely to have complete earnings information.
As far as earnings and sub_race are concerned, the data seem to be MAR.
The race that appears most rarely in the data has around 40% of earnings missing. (incorrecta)
The spine plot you have created in the previous exercise allows you to study missing data patterns between two variables at a time. This idea is generalized to more variables in the form of a mosaic plot.
In this exercise, you will start by creating a dummy variable indicating whether the United States was involved in the production of each movie. To do this, you will use the grepl() function, which checks if the string passed as its first argument is present in the object passed as its second argument. Then, you will draw a mosaic plot to see if the subject’s gender correlates with the amount of missing data on earnings for both US and non-US movies.
The biopics data as well as the VIM package are already loaded for you. Let’s do some exploratory plotting!
Note that a propriety display_image() function has been created to return the output from the latest VIM package version. Make sure to expand the HTML Viewer section.
# Prepare data for plotting and draw a mosaic plot
biopics %>%
# Create a dummy variable for US-produced movies
mutate(is_US_movie = grepl("US", country)) %>%
# Draw mosaic plot
mosaicMiss(highlight = "earnings",
plotvars = c("is_US_movie", "sub_sex"))
# Return plot from latest VIM package - expand the HTML viewer section
#display_image()
One of the most popular imputation methods is the mean imputation, in which missing values in a variable are replaced with the mean of the observed values in this variable. However, in many cases this simple approach is a poor choice. Sometimes a quick look at the data can already alert you to the dangers of mean-imputing.
In this chapter, you will be working with a subsample of the Tropical Atmosphere Ocean (tao) project data. The dataset consists of atmospheric measurements taken in two different time periods at five different locations. The data comes with the VIM package.
In this exercise you will familiarize yourself with the data and perform a simple analysis that will indicate what the consequences of mean imputation could be. Let’s take a look at the tao data!
data(tao, package = "VIM")
names(tao)<-tolower(names(tao))
names(tao)<-sub("[.]", "_", names(tao))
names(tao)<-sub("[.]", "_", names(tao))
# Print first 10 observations
head(tao, 10)
## year latitude longitude sea_surface_temp air_temp humidity uwind vwind
## 1 1997 0 -110 27.59 27.15 79.6 -6.4 5.4
## 2 1997 0 -110 27.55 27.02 75.8 -5.3 5.3
## 3 1997 0 -110 27.57 27.00 76.5 -5.1 4.5
## 4 1997 0 -110 27.62 26.93 76.2 -4.9 2.5
## 5 1997 0 -110 27.65 26.84 76.4 -3.5 4.1
## 6 1997 0 -110 27.83 26.94 76.7 -4.4 1.6
## 7 1997 0 -110 28.01 27.04 76.5 -2.0 3.5
## 8 1997 0 -110 28.04 27.11 78.3 -3.7 4.5
## 9 1997 0 -110 28.02 27.21 78.6 -4.2 5.0
## 10 1997 0 -110 28.05 27.25 76.9 -3.6 3.5
# Get the number of missing values per column
tao %>%
is.na() %>%
colSums()
## year latitude longitude sea_surface_temp
## 0 0 0 3
## air_temp humidity uwind vwind
## 81 93 0 0
# Calculate the number of missing values in air_temp per year
tao %>%
group_by(year) %>%
summarize(num_miss = sum(is.na(air_temp)))
## # A tibble: 2 × 2
## year num_miss
## <int> <int>
## 1 1993 4
## 2 1997 77
Mean imputation can be a risky business. If the variable you are mean-imputing is correlated with other variables, this correlation might be destroyed by the imputed values. You saw it looming in the previous exercise when you analyzed the air_temp variable.
To find out whether these concerns are valid, in this exercise you will perform mean imputation on air_temp, while also creating a binary indicator for where the values are imputed. It will come in handy in the next exercise, when you will be assessing your imputation’s performance. Let’s fill in those missing values!
tao_imp <- tao %>%
# Create a binary indicator for missing values in air_temp
mutate(air_temp_imp = ifelse(is.na(air_temp), TRUE, FALSE)) %>%
# Impute air_temp with its mean
mutate(air_temp = ifelse(is.na(air_temp), mean(air_temp, na.rm = TRUE), air_temp))
# Print the first 10 rows of tao_imp
head(tao_imp, 10)
## year latitude longitude sea_surface_temp air_temp humidity uwind vwind
## 1 1997 0 -110 27.59 27.15 79.6 -6.4 5.4
## 2 1997 0 -110 27.55 27.02 75.8 -5.3 5.3
## 3 1997 0 -110 27.57 27.00 76.5 -5.1 4.5
## 4 1997 0 -110 27.62 26.93 76.2 -4.9 2.5
## 5 1997 0 -110 27.65 26.84 76.4 -3.5 4.1
## 6 1997 0 -110 27.83 26.94 76.7 -4.4 1.6
## 7 1997 0 -110 28.01 27.04 76.5 -2.0 3.5
## 8 1997 0 -110 28.04 27.11 78.3 -3.7 4.5
## 9 1997 0 -110 28.02 27.21 78.6 -4.2 5.0
## 10 1997 0 -110 28.05 27.25 76.9 -3.6 3.5
## air_temp_imp
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## 8 FALSE
## 9 FALSE
## 10 FALSE
In the last exercise, you have mean-imputed air_temp and added an indicator variable to denote which values were imputed, called air_temp_imp. Time to see how well this works.
Upon examining the tao data, you might have noticed that it also contains a variable called sea_surface_temp, which could reasonably be expected to be positively correlated with air_temp. If that’s the case, you would expect these two temperatures to be both high or both low at the same time. Imputing mean air temperature when the sea temperature is high or low would break this relation.
To find out, in this exercise you will select the two temperature variables and the indicator variable and use them to draw a margin plot. Let’s assess the mean imputation!
# Draw a margin plot of air_temp vs sea_surface_temp
tao_imp %>%
select(air_temp, sea_surface_temp, air_temp_imp) %>%
marginplot(delimiter = "imp")
Hot-deck imputation is a simple method that replaces every missing value in a variable by the last observed value in this variable. It’s very fast, as only one pass through the data is needed, but in its simplest form, hot-deck may sometimes break relations between the variables.
In this exercise, you will try it out on the tao dataset. You will hot-deck-impute missing values in the air temperature column air_temp and then draw a margin plot to analyze the relation between the imputed values with the sea surface temperature column sea_surface_temp. Let’s see how it works!
# Load VIM package
library(VIM)
# Impute air_temp in tao with hot-deck imputation
tao_imp <- hotdeck(tao, variable = "air_temp")
# Check the number of missing values in each variable
tao_imp %>%
is.na() %>%
colSums()
## year latitude longitude sea_surface_temp
## 0 0 0 3
## air_temp humidity uwind vwind
## 0 93 0 0
## air_temp_imp
## 0
# Draw a margin plot of air_temp vs sea_surface_temp
tao_imp %>%
select(air_temp, sea_surface_temp, air_temp_imp) %>%
marginplot(delimiter = "imp")
Does the imputation look good? Notice the observations in the top left part of the plot with imputed
air_temp and high sea_surface_temp. These observations must have been preceded by ones with low air_temp in the data frame, and so after hot-deck imputation, they ended up being outliers with low air_temp and high sea_surface_temp.
One trick that may help when hot-deck imputation breaks the relations between the variables is imputing within domains. What this means is that if the variable to be imputed is correlated with another, categorical variable, one can simply run hot-deck separately for each of its categories.
For instance, you might expect air temperature to depend on time, as we are seeing the average temperatures rising due to global warming. The time indicator you have available in the tao data is a categorical variable, year. Let’s first check if the average air temperature is different in each of the two studied years and then run hot-deck within year domains. Finally, you will draw the margin plot again to assess the imputation performance.
# Calculate mean air_temp per year
tao %>%
group_by(year) %>%
summarize(average_air_temp = mean(air_temp, na.rm = TRUE))
## # A tibble: 2 × 2
## year average_air_temp
## <int> <dbl>
## 1 1993 23.4
## 2 1997 27.1
# Hot-deck-impute air_temp in tao by year domain
tao_imp <- hotdeck(tao, variable = "air_temp", domain_var = "year")
# Draw a margin plot of air_temp vs sea_surface_temp
tao_imp %>%
select(air_temp, sea_surface_temp, air_temp_imp) %>%
marginplot(delimiter = "imp")
The results look much better this time. However, if you look at the top right corner of the plot, you will see that the variance in the imputed (orange) values is somewhat larger than among the observed (blue) values. Let’s see if we can improve even further in the next exercise!
k-Nearest-Neighbors (or kNN) imputation fills the missing values in an observation based on the values coming from the k other observations that are most similar to it. The number of these similar observations, called neighbors, that are considered is a parameter that has to be chosen beforehand.
How to choose k? One way is to try different values and see how they impact the relations between the imputed and observed data.
Let’s try imputing humidity in the tao data using three different values of k and see how the imputed values fit the relation between humidity and sea_surface_temp.
Impute humidity with kNN imputation using 30 neighbors and draw a marginplot() of sea_surface_temp vs humidity.
# Impute humidity using 30 neighbors
tao_imp <- kNN(tao, k = 30, variable = "humidity")
# Draw a margin plot of sea_surface_temp vs humidity
tao_imp %>%
select(sea_surface_temp, humidity, humidity_imp) %>%
marginplot(delimiter = "imp", main = "k = 30")
Impute humidity with kNN imputation using 15 neighbors and draw a marginplot of sea_surface_temp vs humidity.
# Impute humidity using 15 neighbors
tao_imp <- kNN(tao, k = 15, variable = "humidity")
# Draw a margin plot of sea_surface_temp vs humidity
tao_imp %>%
select(sea_surface_temp, humidity, humidity_imp) %>%
marginplot(delimiter = "imp", main = "k = 15")
Impute humidity with kNN imputation using 5 neighbors and draw a marginplot of sea_surface_temp vs humidity.
# Impute humidity using 5 neighbors
tao_imp <- kNN(tao, k = 5, variable = "humidity")
# Draw a margin plot of sea_surface_temp vs humidity
tao_imp %>%
select(sea_surface_temp, humidity, humidity_imp) %>%
marginplot(delimiter = "imp", main = "k = 5")
A variation of kNN imputation that is frequently applied uses the so-called distance-weighted aggregation. What this means is that when we aggregate the values from the neighbors to obtain a replacement for a missing value, we do so using the weighted mean and the weights are inverted distances from each neighbor. As a result, closer neighbors have more impact on the imputed value.
In this exercise, you will apply the distance-weighted aggregation while imputing the tao data. This will only require passing two additional arguments to the kNN() function. Let’s try it out!
# Load the VIM package
library(VIM)
# Impute humidity with kNN using distance-weighted mean
tao_imp <- kNN(tao,
k = 5,
variable = "humidity",
numFun = weighted.mean,
weightDist = TRUE)
tao_imp %>%
select(sea_surface_temp, humidity, humidity_imp) %>%
marginplot(delimiter = "imp")
### kNN tricks & tips II: sorting variables
As the k-Nearest Neighbors algorithm loops over the variables in the data to impute them, it computes distances between observations using other variables, some of which have already been imputed in the previous steps. This means that if the variables located earlier in the data have a lot of missing values, then the subsequent distance calculation is based on a lot of imputed values. This introduces noise to the distance calculation.
For this reason, it is a good practice to sort the variables increasingly by the number of missing values before performing kNN imputation. This way, each distance calculation is based on as much observed data and as little imputed data as possible.
Let’s try this out on the tao data!
# Get tao variable names sorted by number of NAs
vars_by_NAs <- tao %>%
is.na() %>%
colSums() %>%
sort(decreasing = FALSE) %>%
names()
# Sort tao variables and feed it to kNN imputation
tao_imp <- tao %>%
select(vars_by_NAs) %>%
kNN(k= 5)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(vars_by_NAs)` instead of `vars_by_NAs` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
tao_imp %>%
select(sea_surface_temp, humidity, humidity_imp) %>%
marginplot(delimiter = "imp")
The kNN you have just coded should be more accurate and robust against faulty imputations, so remember to sort your variables first before performing kNN imputation! This brings us to the end of this chapter. Keep it up! See you in Chapter 3, where you will learn to use statistical and machine learning models to impute missing values!
Sometimes, you can use domain knowledge, previous research or simply your common sense to describe the relations between the variables in your data. In such cases, model-based imputation is a great solution, as it allows you to impute each variable according to a statistical model that you can specify yourself, taking into account any assumptions you might have about how the variables impact each other.
For continuous variables, a popular model choice is linear regression. It doesn’t restrict you to linear relations though! You can always include a square or a logarithm of a variable in the predictors. In this exercise, you will work with the simputation package to run a single linear regression imputation on the tao data and analyze the results. Let’s give it a try!
# Load the simputation package
library(simputation)
# Impute air_temp and humidity with linear regression
formula <- air_temp + humidity ~ year + latitude + sea_surface_temp
tao_imp <- impute_lm(tao, formula)
# Load the simputation package
library(simputation)
# Impute air_temp and humidity with linear regression
formula <- air_temp + humidity ~ year + latitude + sea_surface_temp
tao_imp <- impute_lm(tao, formula)
# Check the number of missing values per column
tao_imp %>%
is.na() %>%
colSums()
## year latitude longitude sea_surface_temp
## 0 0 0 3
## air_temp humidity uwind vwind
## 3 2 0 0
# Print rows of tao_imp in which air_temp or humidity are still missing
tao_imp %>%
filter(is.na(air_temp) | is.na(humidity))
## year latitude longitude sea_surface_temp air_temp humidity uwind vwind
## 1 1993 0 -95 NA NA NA -5.6 3.1
## 2 1993 0 -95 NA NA NA -6.3 0.5
## 3 1993 -2 -95 NA NA 89.9 -3.4 2.4
Linear regression fails when at least one of the predictors is missing. In this case, it was sea_surface_temp. In the next exercise, you will fix it by initializing the missing values before running impute_lm()!
As you have just seen, running impute_lm() might not fill-in all the missing values. To ensure you impute all of them, you should initialize the missing values with a simple method, such as the hot-deck imputation you learned about in the previous chapter, which simply feeds forward the last observed value.
Moreover, a single imputation is usually not enough. It is based on the basic initialized values and could be biased. A proper approach is to iterate over the variables, imputing them one at a time in the locations where they were originally missing.
In this exercise, you will first initialize the missing values with hot-deck imputation and then loop five times over air_temp and humidity from the tao data to impute them with linear regression. Let’s get to it!
# Initialize missing values with hot-deck
tao_imp <- hotdeck(tao)
# Create boolean masks for where air_temp and humidity are missing
missing_air_temp <- tao_imp$air_temp_imp
missing_humidity <- tao_imp$humidity_imp
for (i in 1:5) {
# Set air_temp to NA in places where it was originally missing and re-impute it
tao_imp$air_temp[missing_air_temp] <- NA
tao_imp <- impute_lm(tao_imp, air_temp ~ year + latitude + sea_surface_temp + humidity)
# Set humidity to NA in places where it was originally missing and re-impute it
tao_imp$humidity[missing_humidity] <- NA
tao_imp <- impute_lm(tao_imp, humidity ~ year + latitude + sea_surface_temp + air_temp)
}
That’s a professional approach to model-based imputation you have just coded! But how do we know that 5 is the proper number of iterations to run? Let’s look at the convergence in the next exercise!
Great job iterating over the variables in the last exercise! But how many iterations are needed? When the imputed values don’t change with the new iteration, we can stop.
You will now extend your code to compute the differences between the imputed variables in subsequent iterations. To do this, you will use the Mean Absolute Percentage Change function, defined for you as follows:
mapc <- function(a, b) { mean(abs(b - a) / a, na.rm = TRUE) }
mapc() outputs a single number that tells you how much b differs from a. You will use it to check how much the imputed variables change across iterations. Based on this, you will decide how many of them are needed!
The boolean masks missing_air_temp and missing_humidity are available for you, as is the hotdeck-initialized tao_imp data.
mapc<- function(a, b) {
mean(abs(b - a) / a, na.rm = TRUE)
}
diff_air_temp <- c()
diff_humidity <- c()
for (i in 1:5) {
# Assign the outcome of the previous iteration (or initialization) to prev_iter
prev_iter <- tao_imp
# Impute air_temp and humidity at originally missing locations
tao_imp$air_temp[missing_air_temp] <- NA
tao_imp <- impute_lm(tao_imp, air_temp ~ year + latitude + sea_surface_temp + humidity)
tao_imp$humidity[missing_humidity] <- NA
tao_imp <- impute_lm(tao_imp, humidity ~ year + latitude + sea_surface_temp + air_temp)
# Calculate MAPC for air_temp and humidity and append them to previous iteration's MAPCs
diff_air_temp <- c(diff_air_temp, mapc(prev_iter$air_temp, tao_imp$air_temp))
diff_humidity <- c(diff_humidity, mapc(prev_iter$humidity, tao_imp$humidity))
}
Based on the differences stored in diff_air_temp and diff_humidity, what is the sufficient number of iterations to run?
To answer this question, you can print the two vectors in the console and analyze the numbers, or plot them using the function provided for you: just run plot_diffs(diff_air_temp, diff_humidity) in the console.
plot_diffs <- function(a, b) {
data.frame("mapc" = c(a, b),
"Variable" = c(rep("air_temp", length(a)),
rep("humidity", length(b))),
"Iterations" = c(1:length(a), 1:length(b))) %>%
ggplot(aes(Iterations, mapc, color = Variable)) +
geom_line(size = 1.5) +
ylab("Mean absolute percentage change") +
ggtitle("Changes in imputed variables' values across iterations") +
theme(legend.position = "bottom")
}
plot_diffs(diff_air_temp, diff_humidity)
A popular choice for imputing binary variables is logistic regression. Unfortunately, there is no function similar to impute_lm() that would do it. That’s why you’ll write such a function yourself!
Let’s call the function impute_logreg(). Its first argument will be a data frame df, whose missing values have been initialized and only containing missing values in the column to be imputed. The second argument will be a formula for the logistic regression model.
The function will do the following:
Keep the locations of missing values. Build the model. Make predictions. Replace missing values with predictions. Don’t worry about the line creating imp_var - this is just a way to extract the name of the column to impute from the formula. Let’s do some functional programming!
impute_logreg <- function(df, formula) {
# Extract name of response variable
imp_var <- as.character(formula[2])
# Save locations where the response is missing
missing_imp_var <- is.na(df[imp_var])
# Fit logistic regression mode
logreg_model <- glm(formula, data = df, family = binomial)
# Predict the response and convert it to 0s and 1s
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
# Impute missing values with predictions
df[missing_imp_var, imp_var] <- preds[missing_imp_var]
return(df)
}
The function you wrote is fully operational and can be plugged-in to the loop over the variables you have seen in the previous chapter, just like impute_lm() from the simputation package. Shortly, you will combine these two to impute both continuous and binary variables! But before that, let’s enhance your impute_logreg() to make it replicate the variability in imputed data better.
Simply calling predict() on a model will always return the same value for the same values of the predictors. This results in a small variability in imputed data. In order to increase it, so that the imputation replicates the variability from the original data, we can draw from the conditional distribution. What this means is that instead of always predicting 1 whenever the model outputs a probability larger than 0.5, we can draw the prediction from a binomial distribution described by the probability returned by the model.
You will work on the code you have written in the previous exercise. The following line was removed:
preds <- ifelse(preds >= 0.5, 1, 0)
Your task is to fill its place with drawing from a binomial distribution. That’s just one line of code!
impute_logreg <- function(df, formula) {
# Extract name of response variable
imp_var <- as.character(formula[2])
# Save locations where the response is missing
missing_imp_var <- is.na(df[imp_var])
# Fit logistic regression mode
logreg_model <- glm(formula, data = df, family = binomial)
# Predict the response
preds <- predict(logreg_model, type = "response")
# Sample the predictions from binomial distribution
# preds <- ifelse(preds >= 0.5, 1, 0)
preds <- rbinom(length(preds), size = 1, prob = preds)
# Impute missing values with predictions
df[missing_imp_var, imp_var] <- preds[missing_imp_var]
return(df)
}
Drawing from the conditional distribution will make the imputed data’s variability more similar to the one of original, observed data. With this powerful function at hand, you are now ready to design a model-based imputation flow that takes care of both continuous and binary variables. Let’s do it in the next exercise!
Great job on writing the function to implement logistic regression imputation with drawing from conditional distribution. That’s pretty advanced statistics you have coded! In this exercise, you will combine what you learned so far about model-based imputation to impute different types of variables in the tao data.
Your task is to iterate over variables just like you have done in the previous chapter and impute two variables:
is_hot, a new binary variable that was created out of air_temp, which is 1 if air_temp is at or above 26 degrees and is 0 otherwise; humidity, a continuous variable you are already familiar with. You will have to use the linear regression function you have learned before, as well as your own function for logistic regression. Let’s get to it!
tao$is_hot<-ifelse(tao$air_temp>= 26, 1,0)
# Initialize missing values with hot-deck
tao_imp <- hotdeck(tao)
# Create boolean masks for where is_hot and humidity are missing
missing_is_hot <- tao_imp$is_hot_imp
missing_humidity <- tao_imp$humidity_imp
for (i in 1:3) {
# Set is_hot to NA in places where it was originally missing and re-impute it
tao_imp$is_hot[missing_is_hot] <- NA
tao_imp <- impute_logreg(tao_imp, is_hot ~ sea_surface_temp)
# Set humidity to NA in places where it was originally missing and re-impute it
tao_imp$humidity[missing_humidity] <- NA
tao_imp <- impute_lm(tao_imp, humidity ~ sea_surface_temp + air_temp)
}
A machine learning approach to imputation might be both more accurate and easier to implement compared to traditional statistical models. First, it doesn’t require you to specify relationships between variables. Moreover, machine learning models such as random forests are able to discover highly complex, non-linear relations and exploit them to predict missing values.
In this exercise, you will get acquainted with the missForest package, which builds a separate random forest to predict missing values for each variable, one by one. You will call the imputing function on the biographic movies data, biopics, which you have worked with earlier in the course and then extract the filled-in data as well as the estimated imputation errors.
Let’s plant some random forests!
biopics <- read.csv("~/edi_imp/handing/biopics.csv")
# Load the missForest package
library(missForest)
##
## Attaching package: 'missForest'
## The following object is masked from 'package:VIM':
##
## nrmse
# Impute biopics data using missForest
imp_res <- missForest(biopics)
# Extract imputed data and check for missing values
imp_data <- imp_res$ximp
print(sum(is.na(imp_data)))
## [1] 0
# Extract and print imputation errors
imp_err <- imp_res$OOBerror
print(imp_err)
## NRMSE PFC
## 0.02040963 0.04432624
In the previous exercise you have extracted the estimated imputation errors from missForest’s output. This gave you two numbers:
the normalized root mean squared error (NRMSE) for all continuous variables; the proportion of falsely classified entries (PFC) for all categorical variables. However, it could well be that the imputation model performs great for one continuous variable and poor for another! To diagnose such cases, it is enough to tell missForest to produce variable-wise error estimates. This is done by setting the variablewise argument to TRUE.
The biopics data and missForest package have already been loaded for you, so let’s take a closer look at the errors!
# Impute biopics data with missForest computing per-variable errors
imp_res <- missForest(biopics, variablewise = TRUE)
# Extract and print imputation errors
per_variable_errors <- imp_res$OOBerror
print(per_variable_errors)
## PFC MSE MSE MSE PFC PFC
## 0.0000000 0.0000000 1325.7078957 0.0000000 0.0000000 0.1737589
## MSE PFC
## 0.0000000 0.0000000
# Rename errors' columns to include variable names
names(per_variable_errors) <- paste(names(biopics),
names(per_variable_errors),
sep = "_")
# Print the renamed errors
print(per_variable_errors)
## country_PFC year_MSE earnings_MSE sub_num_MSE sub_type_PFC
## 0.0000000 0.0000000 1325.7078957 0.0000000 0.0000000
## sub_race_PFC non_white_MSE sub_sex_PFC
## 0.1737589 0.0000000 0.0000000
Notice how you produced a range of error measures instead of the default two you’ve seen before. You can now assess impuation quality for each variable separately! This is handy when you need to know how the model performs for a particular variable that you want to further model or analyze.
In the last video, you have seen there are two knobs you can tune to influence the performance of the random forests:
Number of decision trees in each forest. Number of variables used for splitting within decision trees. Increasing each of them might improve the accuracy of the imputation model, but it will also require more time to run. In this exercise, you will explore these ideas yourself by fitting missForest() to the biopics data twice with different settings. As you follow the instructions, pay attention to the errors you will be printing, and to the time the code takes to run.
# Set number of trees to 5 and number of variables used for splitting to 2
imp_res <- missForest(biopics, mtry = 2, ntree = 5)
# Print the resulting imputation errors
print(imp_res$OOBerror)
## NRMSE PFC
## 0.02330450 0.08151093
# Set number of trees to 50 and number of variables used for splitting to 6
imp_res <- missForest(biopics, mtry = 6, ntree = 50)
# Print the resulting imputation errors
print(imp_res$OOBerror)
## NRMSE PFC
## 0.02154217 0.04875887
Compare the errors and the run times of the two imputation models. Can you see a relation? There ain’t no such thing as a free lunch, they say. To get a more precise imputation, you had to spend more in computing time! Congratulations on finishing the chapter! See you in the final chapter, where you will learn to incorporate uncertainty from imputation into your analyses and predictions.
Whenever you perform any analysis or modeling on imputed data, you should account for the uncertainty from imputation. Running a model on a dataset imputed only once ignores the fact that imputation estimates the missing values with uncertainty. Standard errors from such a model tend to be too small. The solution to this is multiple imputation and one way to implement it is by bootstrapping.
In the upcoming exercises, you will work with the familiar biopics data. The goal is to use multiple imputation by bootstrapping and linear regression to see if, based on the data at hand, biographical movies featuring females earn less than those about males.
Let’s start with writing a function that creates a bootstrap sample, imputes it, and fits a linear regression model.
calc_gender_coef <- function(data, indices) {
# Get bootstrap sample
data_boot <- data[indices, ]
# Impute with kNN imputation
data_imp <- kNN(data_boot, k = 5)
# Fit linear regression
linear_model <- lm(earnings ~ sub_sex + sub_type + year, data = data_imp)
# Extract and return gender coefficient
gender_coefficient <- coef(linear_model)[2]
return(gender_coefficient)
}
The calc_gender_coef() function you have just coded takes the data and bootstrap indices as inputs, and outputs our statistic of interest - the impact of gender on earnings from linear regression. You can now feed this function to the bootstrapping algorithm!
Good job writing calc_gender_coef() in the last exercise! This function creates a bootstrap sample, imputes it and, outputs the linear regression coefficient describing the impact of movie subject’s being a female on the movie’s earnings.
In this exercise, you will use the boot package in order to obtain a bootstrapped distribution of such coefficients. The spread of this distribution will capture the uncertainty from imputation. You will also look at how the bootstrapped distribution differs from a single-time imputation and regression. Let’s do some bootstrapping!
# Load the boot library
library(boot)
# Run bootstrapping on biopics data
boot_results <- boot(biopics, statistic = calc_gender_coef, R = 50)
# Print and plot bootstrapping results
print(boot_results)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = biopics, statistic = calc_gender_coef, R = 50)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.060346 0.06569684 5.935422
plot(boot_results)
Fantastic bootstrapping! If you had run the kNN imputation and the regression analysis on biopics data only once, you would have obtained the female-coefficient of -1.45 (called ‘original’ in the console output), suggesting that movies about females indeed earn less. However, correcting for the uncertainty from imputation, you have obtained the distribution that covers both negative and postive values!
Having bootstrapped the distribution of the female-effect coefficient in the last exercise, you can now use it to estimate a confidence interval. It will allow you to make the following assessment about your data: “Given the uncertainty from imputation, we are 95% sure that the female-effect on earnings is between a and b”, where a and b are the lower and upper bounds of the interval.
In the last exercise, you have run bootstrapping with R = 50 replicates. In most applications, however, this is not enough. In this exercise, you can use boot_results that were prepared for you using 1000 replicates. First, you will look at the bootstrapped distribution to see if it looks normal. If so, you can then rely on the normal distribution to calculate the confidence interval.
# Run bootstrapping on biopics data
#boot_results <- boot(biopics, statistic = calc_gender_coef, R = 1000)
# Plot and print boot_results
plot(boot_results)
print(boot_results)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = biopics, statistic = calc_gender_coef, R = 50)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.060346 0.06569684 5.935422
# Calculate and print confidence interval
boot_ci <- boot.ci(boot_results, conf = 0.95, type = "norm")
print(boot_ci)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 50 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_results, conf = 0.95, type = "norm")
##
## Intervals :
## Level Normal
## 95% (-9.639, 13.628 )
## Calculations and Intervals on Original Scale
Despite it leaning to be a negative relationship, bootstrap replicates show that some movies with female leads actually earn more! Accounting for the uncertainty from imputation, you cannot be 100% sure about the direction of this relation, even though a single analysis suggests otherwise.
Multiple imputation by chained equations, or MICE, allows us to estimate the uncertainty from imputation by imputing a data set multiple times with model-based imputation, while drawing from conditional distributions. This way, each imputed data set is slightly different. Then, an analysis is conducted on each of them and the results are pooled together, yielding the quantities of interest, alongside their confidence intervals that reflect the imputation uncertainty.
In this exercise, you will practice the typical MICE flow: mice() - with() - pool(). You will perform a regression analysis on the biopics data to see which subject occupation, sub_type, is associated with highest movie earnings. Let’s play with mice!
# Load mice package
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
# Impute biopics with mice using 5 imputations
biopics_multiimp <- mice(biopics, m = 5, seed = 3108)
##
## iter imp variable
## 1 1 earnings sub_race
## 1 2 earnings sub_race
## 1 3 earnings sub_race
## 1 4 earnings sub_race
## 1 5 earnings sub_race
## 2 1 earnings sub_race
## 2 2 earnings sub_race
## 2 3 earnings sub_race
## 2 4 earnings sub_race
## 2 5 earnings sub_race
## 3 1 earnings sub_race
## 3 2 earnings sub_race
## 3 3 earnings sub_race
## 3 4 earnings sub_race
## 3 5 earnings sub_race
## 4 1 earnings sub_race
## 4 2 earnings sub_race
## 4 3 earnings sub_race
## 4 4 earnings sub_race
## 4 5 earnings sub_race
## 5 1 earnings sub_race
## 5 2 earnings sub_race
## 5 3 earnings sub_race
## 5 4 earnings sub_race
## 5 5 earnings sub_race
## Warning: Number of logged events: 50
# Fit linear regression to each imputed data set
lm_multiimp <- with(biopics_multiimp, lm(earnings ~ year + sub_type))
# Pool and summarize regression results
lm_pooled <- pool(lm_multiimp)
summary(lm_pooled, conf.int = TRUE, conf.level = 0.95)
## term estimate std.error statistic
## 1 (Intercept) -287.8866750 490.062824 -0.58744851
## 2 year 0.1612176 0.239277 0.67376974
## 3 sub_typeAcademic (Philosopher) -32.8423364 39.593926 -0.82947915
## 4 sub_typeActivist -17.4281787 16.052579 -1.08569339
## 5 sub_typeActor -30.7740287 19.378762 -1.58802862
## 6 sub_typeActress -27.3033456 21.249448 -1.28489670
## 7 sub_typeActress / activist 16.0962907 39.192849 0.41069458
## 8 sub_typeArtist -27.4779956 18.148136 -1.51409465
## 9 sub_typeAthlete -4.2336944 12.503822 -0.33859202
## 10 sub_typeAthlete / military 79.1943734 38.055447 2.08102595
## 11 sub_typeAuthor -23.6540827 18.471621 -1.28056347
## 12 sub_typeAuthor (poet) -23.4506193 19.985477 -1.17338304
## 13 sub_typeComedian -22.9330598 21.576033 -1.06289508
## 14 sub_typeCriminal -2.6478096 16.754147 -0.15803906
## 15 sub_typeGovernment -5.0221576 21.879188 -0.22954040
## 16 sub_typeHistorical -9.3734433 19.183957 -0.48860845
## 17 sub_typeJournalist -26.5071032 29.005345 -0.91386960
## 18 sub_typeMedia -11.5302020 26.461566 -0.43573393
## 19 sub_typeMedicine 4.7788761 15.006237 0.31845932
## 20 sub_typeMilitary 10.9417685 19.325322 0.56618816
## 21 sub_typeMilitary / activist 37.2248141 39.752579 0.93641257
## 22 sub_typeMusician -19.7931797 17.320867 -1.14273611
## 23 sub_typeOther -15.5895605 16.044509 -0.97164460
## 24 sub_typePolitician -13.6751859 39.752579 -0.34400752
## 25 sub_typeSinger 0.6677979 17.844592 0.03742298
## 26 sub_typeTeacher 51.1575084 39.268925 1.30274786
## 27 sub_typeWorld leader 5.9976436 16.882980 0.35524793
## df p.value 2.5 % 97.5 %
## 1 3.983993 0.58858265 -1650.677928 1074.9045785
## 2 4.030421 0.53712441 -0.501149 0.8235843
## 3 139.785256 0.40824758 -111.122704 45.4380311
## 4 10.576004 0.30174143 -52.933253 18.0768961
## 5 10.847500 0.14097812 -73.499669 11.9516115
## 6 7.876739 0.23532207 -76.438456 21.8317642
## 7 169.869958 0.68181408 -61.271473 93.4640548
## 8 8.280382 0.16719720 -69.082290 14.1262992
## 9 15.246283 0.73953513 -30.847513 22.3801244
## 10 336.810947 0.03818668 4.338081 154.0506662
## 11 7.047583 0.24087818 -67.272814 19.9646489
## 12 10.612273 0.26630514 -67.635233 20.7339945
## 13 16.686775 0.30297041 -68.519689 22.6535693
## 14 7.275709 0.87872330 -41.962758 36.6671385
## 15 81.573759 0.81902349 -48.550237 38.5059216
## 16 6.172715 0.64199292 -55.998343 37.2514567
## 17 113.284343 0.36272632 -83.970362 30.9561559
## 18 6.128774 0.67795894 -75.950965 52.8905609
## 19 247.938063 0.75040464 -24.777080 34.3348320
## 20 6.645835 0.58986247 -35.253693 57.1372305
## 21 130.043524 0.35079655 -41.420661 115.8702894
## 22 6.996299 0.29074124 -60.754913 21.1685537
## 23 7.168854 0.36286255 -53.348394 22.1692725
## 24 130.043524 0.73139625 -92.320661 64.9702894
## 25 10.377953 0.97085786 -38.897026 40.2326216
## 26 163.415623 0.19449369 -26.382404 128.6974204
## 27 14.000056 0.72769899 -30.212733 42.2080205
You have followed the mice - with - pool flow to impute, model and pool the results. Now take a look at the console output: a couple of sub_types have a positive impact on earnings. However, accounting for imputation uncertainty with 95% confidence, we are never sure of these effects, as the lower bounds are negative! With one exception: for sub_typeAthlete / military, both upper and lower bounds are positive. What we can say for sure is thus that movies about military athletes are popular!
MICE creates a separate imputation model for each variable in the data. What kind of model it is depends on the type of the variable in question. A popular way to specify the kinds of models we want to use is set a default model for each of the four variable types.
You can do this by passing the defaultMethod argument to mice(), which should be a vector of length 4 containing the default imputation methods for:
In this exercise, you will take advantage of mice’s documentation to view the list of available methods and to pick the desired ones for the algorithm to use. Let’s do some model selection!
# Impute biopics using the methods specified in the instruction
biopics_multiimp <- mice(biopics, m = 20,
defaultMethod = c("cart", "lda", "pmm", "polr"))
##
## iter imp variable
## 1 1 earnings sub_race
## 1 2 earnings sub_race
## 1 3 earnings sub_race
## 1 4 earnings sub_race
## 1 5 earnings sub_race
## 1 6 earnings sub_race
## 1 7 earnings sub_race
## 1 8 earnings sub_race
## 1 9 earnings sub_race
## 1 10 earnings sub_race
## 1 11 earnings sub_race
## 1 12 earnings sub_race
## 1 13 earnings sub_race
## 1 14 earnings sub_race
## 1 15 earnings sub_race
## 1 16 earnings sub_race
## 1 17 earnings sub_race
## 1 18 earnings sub_race
## 1 19 earnings sub_race
## 1 20 earnings sub_race
## 2 1 earnings sub_race
## 2 2 earnings sub_race
## 2 3 earnings sub_race
## 2 4 earnings sub_race
## 2 5 earnings sub_race
## 2 6 earnings sub_race
## 2 7 earnings sub_race
## 2 8 earnings sub_race
## 2 9 earnings sub_race
## 2 10 earnings sub_race
## 2 11 earnings sub_race
## 2 12 earnings sub_race
## 2 13 earnings sub_race
## 2 14 earnings sub_race
## 2 15 earnings sub_race
## 2 16 earnings sub_race
## 2 17 earnings sub_race
## 2 18 earnings sub_race
## 2 19 earnings sub_race
## 2 20 earnings sub_race
## 3 1 earnings sub_race
## 3 2 earnings sub_race
## 3 3 earnings sub_race
## 3 4 earnings sub_race
## 3 5 earnings sub_race
## 3 6 earnings sub_race
## 3 7 earnings sub_race
## 3 8 earnings sub_race
## 3 9 earnings sub_race
## 3 10 earnings sub_race
## 3 11 earnings sub_race
## 3 12 earnings sub_race
## 3 13 earnings sub_race
## 3 14 earnings sub_race
## 3 15 earnings sub_race
## 3 16 earnings sub_race
## 3 17 earnings sub_race
## 3 18 earnings sub_race
## 3 19 earnings sub_race
## 3 20 earnings sub_race
## 4 1 earnings sub_race
## 4 2 earnings sub_race
## 4 3 earnings sub_race
## 4 4 earnings sub_race
## 4 5 earnings sub_race
## 4 6 earnings sub_race
## 4 7 earnings sub_race
## 4 8 earnings sub_race
## 4 9 earnings sub_race
## 4 10 earnings sub_race
## 4 11 earnings sub_race
## 4 12 earnings sub_race
## 4 13 earnings sub_race
## 4 14 earnings sub_race
## 4 15 earnings sub_race
## 4 16 earnings sub_race
## 4 17 earnings sub_race
## 4 18 earnings sub_race
## 4 19 earnings sub_race
## 4 20 earnings sub_race
## 5 1 earnings sub_race
## 5 2 earnings sub_race
## 5 3 earnings sub_race
## 5 4 earnings sub_race
## 5 5 earnings sub_race
## 5 6 earnings sub_race
## 5 7 earnings sub_race
## 5 8 earnings sub_race
## 5 9 earnings sub_race
## 5 10 earnings sub_race
## 5 11 earnings sub_race
## 5 12 earnings sub_race
## 5 13 earnings sub_race
## 5 14 earnings sub_race
## 5 15 earnings sub_race
## 5 16 earnings sub_race
## 5 17 earnings sub_race
## 5 18 earnings sub_race
## 5 19 earnings sub_race
## 5 20 earnings sub_race
## Warning: Number of logged events: 200
# Print biopics_multiimp
print(biopics_multiimp)
## Class: mids
## Number of multiple imputations: 20
## Imputation methods:
## country year earnings sub_num sub_type sub_race non_white sub_sex
## "" "" "cart" "" "" "pmm" "" ""
## PredictorMatrix:
## country year earnings sub_num sub_type sub_race non_white sub_sex
## country 0 1 1 1 1 1 1 1
## year 1 0 1 1 1 1 1 1
## earnings 1 1 0 1 1 1 1 1
## sub_num 1 1 1 0 1 1 1 1
## sub_type 1 1 1 1 0 1 1 1
## sub_race 1 1 1 1 1 0 1 1
## Number of logged events: 200
## it im dep meth out
## 1 1 1 earnings cart sub_typeAcademic (Philosopher)
## 2 1 1 sub_race pmm sub_typeMilitary / activist
## 3 1 2 earnings cart sub_typeAcademic (Philosopher)
## 4 1 2 sub_race pmm sub_typeMilitary / activist
## 5 1 3 earnings cart sub_typeAcademic (Philosopher)
## 6 1 3 sub_race pmm sub_typeMilitary / activist
The ability to specify imputation models might come in handy when you see some specific methods underperforming. Another factor influencing how the imputation methods work is the set of predictors they use. Let’s look at how to set these in the next exercise.
An important decision that needs to be taken when using model-based imputation is which variables should be included as predictors, and in which models. In mice(), this is governed by the predictor matrix and by default, all variables are used to impute all others.
In case of many variables in the data or little time to do a proper model selection, you can use mice’s functionality to create a predictor matrix based on the correlations between the variables. This matrix can then be passed to mice(). In this exercise, you will practice exactly this: you will first build a predictor matrix such that each variable will be imputed using variables most correlated to it; then, you will feed your predictor matrix to the imputing function. Let’s try this simple model selection!
# Create predictor matrix with minimum correlation of 0.1
pred_mat <- quickpred(biopics, mincor = 0.1)
# Impute biopics with mice
biopics_multiimp <- mice(biopics,
m = 10,
predictorMatrix = pred_mat,
seed = 3108)
##
## iter imp variable
## 1 1 earnings sub_race
## 1 2 earnings sub_race
## 1 3 earnings sub_race
## 1 4 earnings sub_race
## 1 5 earnings sub_race
## 1 6 earnings sub_race
## 1 7 earnings sub_race
## 1 8 earnings sub_race
## 1 9 earnings sub_race
## 1 10 earnings sub_race
## 2 1 earnings sub_race
## 2 2 earnings sub_race
## 2 3 earnings sub_race
## 2 4 earnings sub_race
## 2 5 earnings sub_race
## 2 6 earnings sub_race
## 2 7 earnings sub_race
## 2 8 earnings sub_race
## 2 9 earnings sub_race
## 2 10 earnings sub_race
## 3 1 earnings sub_race
## 3 2 earnings sub_race
## 3 3 earnings sub_race
## 3 4 earnings sub_race
## 3 5 earnings sub_race
## 3 6 earnings sub_race
## 3 7 earnings sub_race
## 3 8 earnings sub_race
## 3 9 earnings sub_race
## 3 10 earnings sub_race
## 4 1 earnings sub_race
## 4 2 earnings sub_race
## 4 3 earnings sub_race
## 4 4 earnings sub_race
## 4 5 earnings sub_race
## 4 6 earnings sub_race
## 4 7 earnings sub_race
## 4 8 earnings sub_race
## 4 9 earnings sub_race
## 4 10 earnings sub_race
## 5 1 earnings sub_race
## 5 2 earnings sub_race
## 5 3 earnings sub_race
## 5 4 earnings sub_race
## 5 5 earnings sub_race
## 5 6 earnings sub_race
## 5 7 earnings sub_race
## 5 8 earnings sub_race
## 5 9 earnings sub_race
## 5 10 earnings sub_race
## Warning: Number of logged events: 50
# Print biopics_multiimp
print(biopics_multiimp)
## Class: mids
## Number of multiple imputations: 10
## Imputation methods:
## country year earnings sub_num sub_type sub_race non_white sub_sex
## "" "" "pmm" "" "" "polyreg" "" ""
## PredictorMatrix:
## country year earnings sub_num sub_type sub_race non_white sub_sex
## country 0 0 0 0 0 0 0 0
## year 0 0 0 0 0 0 0 0
## earnings 1 1 0 0 0 1 1 0
## sub_num 0 0 0 0 0 0 0 0
## sub_type 0 0 0 0 0 0 0 0
## sub_race 0 1 1 1 1 0 1 0
## Number of logged events: 50
## it im dep meth out
## 1 1 1 sub_race polyreg sub_typeMilitary / activist
## 2 1 2 sub_race polyreg sub_typeMilitary / activist
## 3 1 3 sub_race polyreg sub_typeMilitary / activist
## 4 1 4 sub_race polyreg sub_typeMilitary / activist
## 5 1 5 sub_race polyreg sub_typeMilitary / activist
## 6 1 6 sub_race polyreg sub_typeMilitary / activist
The first step in working with incomplete data is to gain some insights into the missingness patterns, and a good way to do it is with visualizations. You will start your analysis of the africa data with employing the VIM package to create two visualizations: the aggregation plot and the spine plot. They will tell you how many data are missing, in which variables and configurations, and whether we can say something about the missing data mechanism. Let’s kick off with some plotting!
africa <- read.csv("~/edi_imp/handing/africa.csv", sep = ";")
# Load VIM
library(VIM)
# Draw a combined aggregation plot of africa
africa %>%
aggr(combined = TRUE, numbers = TRUE)
# Draw a spine plot of country vs trade
africa %>%
select(country, trade) %>%
spineMiss()
Based on the spine plot you have just created, which of the following statements is TRUE?
Possible Answers
There is more missing data in trade for Cameroon than for Burundi.
There is more missing data in gdp_pc for Burundi than for Cameroon.
There is no country with more than 20% trade values missing.
Correct, there are not that many missing values! Also, notice from the spine plot that the africa data seem to be MAR - at least with respect to the GDP and country, which means it can be imputed.
Good job on visualizing missing data in the previous exercise! You have discovered there are some missing entries in GDP, gdp_pc, and trade as percentage of GDP, trade. Also, you suspect the data are MAR, and thus imputable. In this exercise, you will make use of multiple imputation from the mice package to impute the africa data. Then, you will draw a strip plot of gdp_pc vs trade to see if the imputed data do not break the relation between these variables. Let mice do the job!
# Load mice
library(mice)
# Impute africa with mice
africa_multiimp <- mice(africa, m = 5, defaultMethod = "cart", seed = 3108)
##
## iter imp variable
## 1 1 gdp_pc trade
## 1 2 gdp_pc trade
## 1 3 gdp_pc trade
## 1 4 gdp_pc trade
## 1 5 gdp_pc trade
## 2 1 gdp_pc trade
## 2 2 gdp_pc trade
## 2 3 gdp_pc trade
## 2 4 gdp_pc trade
## 2 5 gdp_pc trade
## 3 1 gdp_pc trade
## 3 2 gdp_pc trade
## 3 3 gdp_pc trade
## 3 4 gdp_pc trade
## 3 5 gdp_pc trade
## 4 1 gdp_pc trade
## 4 2 gdp_pc trade
## 4 3 gdp_pc trade
## 4 4 gdp_pc trade
## 4 5 gdp_pc trade
## 5 1 gdp_pc trade
## 5 2 gdp_pc trade
## 5 3 gdp_pc trade
## 5 4 gdp_pc trade
## 5 5 gdp_pc trade
# Draw a stripplot of gdp_pc versus trade
stripplot(africa_multiimp, gdp_pc ~ trade | .imp, pch = 20, cex = 1)
It seems the imputation works well: there are small clusters in the scatter plots, likely corresponding to different countries. Each imputed data point fits into one of the clusters, instead of being an outlier somewhere between the clusters. Having done the imputation, you can now proceed to modeling!
In the last exercise, you have run mice to multiply impute the africa data. In this one, you will implement the other two steps of the mice - with - pool flow you’ve learned about earlier in the course. The model of interest is a linear regression that explains the GDP, gdp_pc, with other variables. You are particularly interested in the coefficient of civil liberties, civlib. Is more liberty associated with more economic growth once we incorporate the uncertainty from imputation? Let’s find out!
# Fit linear regression to each imputed data set
lm_multiimp <- with(africa_multiimp, lm(gdp_pc ~ country + year + trade + infl + civlib))
# Pool regression results
lm_pooled <- pool(lm_multiimp)
# Summarize pooled results
summary(lm_pooled, conf.int = TRUE, conf.level = 0.9)
## term estimate std.error statistic df p.value
## 1 (Intercept) -30005.113539 5592.2088438 -5.3655209 107.8640 4.654452e-07
## 2 countryBurundi 70.655824 61.9717253 1.1401300 107.1336 2.567747e-01
## 3 countryCameroon 653.969299 58.3491217 11.2078688 106.1787 0.000000e+00
## 4 countryCongo 1337.668745 112.9974527 11.8380434 107.8509 0.000000e+00
## 5 countrySenegal 522.377263 78.6118111 6.6450226 107.4387 1.293220e-09
## 6 countryZambia 410.935079 83.5292689 4.9196537 107.8369 3.128698e-06
## 7 year 15.299939 2.8128543 5.4392932 107.8559 3.368002e-07
## 8 trade 5.059562 1.5814802 3.1992573 107.4800 1.811115e-03
## 9 infl -4.347048 0.9837546 -4.4188336 108.0143 2.369012e-05
## 10 civlib -40.661539 132.1221617 -0.3077571 106.5790 7.588678e-01
## 5 % 95 %
## 1 -39283.165341 -20727.061736
## 2 -32.167739 173.479388
## 3 557.148769 750.789829
## 4 1150.194108 1525.143382
## 5 391.947680 652.806846
## 6 272.351096 549.519062
## 7 10.633121 19.966758
## 8 2.435642 7.683482
## 9 -5.979179 -2.714917
## 10 -259.888744 178.565667
Based on the summary of the pooled regression results that you have just printed to the console, which of the following statements about the civil liberties in Africa is false?
Possible Answers
On average, more liberty is associated with higher GDP.
There is 5% chance that civlib’s coefficient in regression on gdp_pc is larger than 342.154078.
Based on the 90% confidence interval, we are sure that civlib’s impact on gdp_pc is positive.
Correct, this one is false! Since the lower and upper bounds have different signs, we cannot be sure of the direction of the effect. Congratulations, you have come a long way and learned a lot. Well done! Let’s sum it all up in the final video of the course.
Final remarks Congratulations on making it until the end of the course! Let’s recap what you have learned.
What you know In Chapter 1, you saw how modeling incomplete data can be troublesome and that it requires a special treatment. You also learned about the three missing data mechanisms and how to gain insights into missing data patterns by using visualizations from the VIM package and statistical tests. In Chapter 2, we covered donor-based imputation. First, you’ve seen why mean imputation is typically a poor choice. Then, you learned to use hot-deck and kNN imputation from the VIM package, along with some tricks that make them work even better.
What you know In Chapter 3, you learned the model-based imputation approach of looping over variables and imputing them until convergence. You’ve also seen how to increase variability of imputed data by drawing from conditional distributions. Finally, you’ve learned about tree-based imputation with the missForest package. In Chapter 4, you’ve seen two methods of incorporating the uncertainty from imputation into modeling: bootstrapping using the boot package and multiple imputation by chained equations using the mice package.
Which imputation method to choose? You learned about a lot of different imputation methods. Which one to choose and when? Here are some loose guidelines. If you have a lot of data or if your imputation has to run in real-time in production, you’re best-off with the quick hot-deck. If you suspect specific relations between the variables based on domain knowledge, you can use this knowledge in model-based imputation. Otherwise, if the imputation need not be very fast and the relations between the variables are not obvious, a machine learning approach such as kNN or tree-based imputation is your best bet.
How to estimate uncertainty from imputation? You’ve also learned about two methods to estimate the uncertainty from imputation: bootstrapping and mice. Which one to pick? Again, let me offer some loose guidelines. If your application has to be relatively fast or if you have ideas about which models to use and how to specify them, then mice is the way to go. Otherwise, if you would like to use a non-parametric method such as kNN or hot-deck, or if you simply don’t want to worry about the assumptions of specific models, then the bootstrap might be a better choice.
Next steps One last thing before you go. If you would like to dig even deeper into the arcane imputation knowledge, I suggest you google miceVignettes. The authors of the mice package provide a series of six vignettes with R code and examples which touch upon issues such as passive imputation, post-processing of imputed data, imputing multi-level data or sensitivity analysis. If you easily absorb knowledge from books, then “Flexible Imputation of Missing Data” by Stef van Buuren is a must-read. It also uses the mice package. Finally, there are some other great R packages worth exploring that we had no time to cover in this course, such as Amelia or mi, which allow for imputing time series and panel data.
Congratulations and good luck! Once again, congratulations on finishing the course and thanks for staying with me. I hope the knowledge and skills you’ve gained will make your work with incomplete data easier and more productive. Good luck!